Transformer model

The core architecture behind the LLMs. It uses Attention mechanism

http://jalammar.github.io/illustrated-transformer/
- https://www.youtube.com/watch?v=-QH8fRhqFHM : GPT is more about decoders (generation), BERT is more about encoders (translation and representation).

Google’s T5 paper provides a unified framework to understand and train transformer models.

Tutorials and reviews

Implementations

See also Implementations

Transformer models: an introduction and catalog — 2023 Edition - Amatriain2023transformer

https://huggingface.co/blog/how-to-train shows how to train a transformer model from scratch. See also How to pretrain transformer models or A complete Hugging Face tutorial: how to build and train a vision transformer

gpt-fast: https://github.com/pytorch-labs/gpt-fast
- Accelerating Generative AI with PyTorch II: GPT, Fast

The Genius of DeepSeek’s 57X Efficiency Boost [MLA]

Applications

They are used in other areas outside Language models, including Computer vision and Reinforcement learning (Decision transformer).

Internal workings

Paeng2024folded

Sanford2024transformers for the connection to Massively parallel computation.

Teh2025solving studies whether transformer can solve an empirical Bayes problem.

Cohen2025spectral studies how transformer model can predict Shortest path on a graph.

Circuit analysis

Park2025does identifies temporal heads by performing circuit analysis.